Drain LibMR threads to a safe point before fork() (MOD-15307) by gabsow · Pull Request #97 · RedisGears/LibMR

gabsow · 2026-06-10T18:06:46Z

Summary

Adds MR_DrainForFork() / MR_ResumeAfterFork(). Called from the embedding module's pre-fork handler (main thread), MR_DrainForFork():

parks the event-loop thread at a between-tasks safe point via a posted task, and
bounded-waits the worker pool to idle,

so no LibMR thread holds a libc lock at fork() (which would ghost-lock the child). MR_ResumeAfterFork() releases the parked event-loop thread (called after the fork, on success or cancel).

Cooperative, bounded, fail-open. Deliberately not the existing mr_thpool_pause (SIGUSR2), which can freeze a worker mid-malloc holding the arena lock — exactly the ghost-lock this prevents. A worker still blocked acquiring the GIL is already malloc-safe, so a drain timeout there is benign.

Why / depends on

Fixes the RedisTimeSeries ASM-migration nightly hangs (MOD-14615 valgrind, MOD-14239 sanitizer). The embedding module wires this to redis core's new FORK_CHILD_PRE subevent (redis/redis#15327).

Pre-merge note

The two RedisModule_Log(..., "notice", ...) lines in MR_DrainForFork/MR_ResumeAfterFork should be downgraded to debug (kept at notice to confirm the drain fires during CI validation).

🤖 Generated with Claude Code

Note

Medium Risk
Fork-time threading and allocator safety are sensitive; incomplete drain can still leave ghost locks, though behavior is bounded and fail-open with warnings.

Overview
Adds MR_DrainForFork() and MR_ResumeAfterFork() so embedding modules can quiesce LibMR immediately before fork(). MR_DrainForFork() posts a task that parks the event-loop thread at a between-tasks safe point, then bounded-waits (2s) for that park and for the execution worker pool to go idle via new mr_thpool_wait_timeout(). If quiescence is incomplete, it logs a warning and continues (fail-open). MR_ResumeAfterFork() unblocks the parked event-loop thread and must be paired with every drain (after fork success or cancel).

This is intentionally not mr_thpool_pause (SIGUSR2), which can stop workers mid-malloc and worsen ghost-lock risk.

^{Reviewed by Cursor Bugbot for commit df7cc09. Bugbot is set up for automated code reviews on this repo. Configure here.}

Add MR_DrainForFork()/MR_ResumeAfterFork(). On the main thread (from the module's FORK_CHILD_PRE handler) park the event-loop thread at a between-tasks safe point via a posted task and bounded-wait the worker pool to idle, so no LibMR thread holds a libc lock at fork() (ghost-lock). Bounded + fail-open; cooperative (not the SIGUSR2 mr_thpool_pause, which can freeze a worker mid-malloc). Resume releases the parked event-loop thread. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Logs the time spent quiescing the threads (excluding the fork that follows), plus whether the event-loop thread parked and the busy-worker count, so the pre-fork drain cost can be measured. Debug level so it is silent in production. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

When a multi-shard execution hits max-idle (EXECUTION_DEFAULT_MAX_IDLE_MS), the coordinator only logged a fixed "execution max idle reached" string with no clue which peer stalled. Now, at timeout, log the non-responding peer shard id(s) and endpoint(s), replies received vs expected, the elapsed wait, and time since last progress — at warning level so it is captured at the default loglevel (it coincides with the user-visible failure and is rate-bounded by nMaxIdleReached). Track responders by recording each ACK / NOTIFY_DONE sender node-id into a heap-strings set on the Execution (freed in MR_FreeExecution, the single final owner); pending = cluster peers minus that set, formatted by a new MR_ClusterFormatPendingPeers helper. Dispatch time is stamped and logged at debug. All new state is touched only on the event-loop thread, so no extra locking. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MR_DrainForFork only waited for mr_thpool_num_threads_working() to reach 0, ignoring jobs that are queued but not yet picked up by a worker. Such a job could be dequeued and start running (and allocating) immediately after the check, right as fork() runs -- defeating the drain. Add mr_thpool_num_jobs_in_queue() and wait for BOTH the working count and the queue to reach 0 (the same invariant thpool_wait uses). Addresses the bugbot "fork drain ignores queued jobs" finding. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f6fe687. Configure here.}

mr_thpool_num_jobs_in_queue() read jobqueue.len without holding jobqueue.rwmutex, while jobqueue_push/pull mutate it under that lock -- a data race that could make MR_DrainForFork see a stale (zero) queue and fork early. Take rwmutex for the read. Addresses the bugbot "unlocked job queue length read" finding. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gabsow · 2026-06-18T05:17:14Z

@galcohen-redislabs wanted your opinion on the last commit (995f842) before we lean on it. It takes jobqueue.rwmutex when reading jobqueue.len in mr_thpool_num_jobs_in_queue, to address the Bugbot "unlocked job-queue length read" finding.

Two things I am not sure about:

Is it really necessary? The neighbouring mr_thpool_num_threads_working() reads num_threads_working unlocked as well, and MR_DrainForFork polls both in a 1ms loop that is best-effort (it forks anyway on timeout) — so a stale len self-corrects on the next tick. By that logic the unlocked read is arguably benign (same as the existing one), and we could just reply to the bot instead of adding a lock.
Isn't it dangerous? It adds a lock acquisition on the main thread, in the pre-fork drain path. As far as I can tell it's safe — the lock is taken and released inside the accessor (never held across fork()), and workers only hold rwmutex briefly in jobqueue_push/jobqueue_pull without needing the GIL, so there's no GIL/lock inversion or fork-time deadlock. But I would rather get your read before relying on it.

Happy to drop the lock and instead annotate the read as a benign race (matching num_threads_working) if you think that is cleaner/safer. What do you think?

galcohen-redislabs · 2026-06-21T12:45:55Z

+ * `responded` set into "id(ip:port),..." in `out`, so a max-idle timeout can
+ * name the non-responding shard(s). Runs on the event-loop thread (same as the
+ * message handlers), so reading the cluster node table needs no extra locking.
+ * The node-id is the same NUL-terminated string used as the responded-set key. */


The whole comment is too specific: no need to specify the ticket number and no need to explain the implementation and cause for this specific issue.

The new diagnostic data of more explicit names (of workers that didn't respond after a timeout, etc.), including the needed locking mechanism and the monotonic clock, etc. are not really related to the core issue in this ticket (i.e., the forking coordination between the module and the redis core).

The whole thing is just a code slop and should not be included in this PR. If we really want such code (only if it is really a big issue, which honestly I don't think it is...) then we should open a specific ticket for that.

galcohen-redislabs · 2026-06-21T12:57:30Z

+        Node* n = mr_dictGetVal(entry);
+        if (n->isMe) continue;
+        if (responded && mr_dictFind(responded, n->id)) continue;
+        int w = snprintf(out + off, outLen - off, "%s%s(%s:%u)",


The size_t math is dangerous. As a rule of thumb you don't want to subtract two unsigned (for the fear of a wrap-around to large numbers).
Better to add an explicit guard here and change the comparison below to use additions only.

galcohen-redislabs · 2026-06-21T12:59:53Z


 size_t MR_ClusterGetSize();

+/* MOD-14615: format peer shards (id(ip:port)) that are NOT in `responded` into


Here too: no need to include ticket number or implementation details or reasoning for doing so or apologies about the struct declaration...

(also in all other places; we don't need comments explaining how and why code was written to solve a specific ticket)

galcohen-redislabs · 2026-06-21T14:24:37Z

+    REDISMODULE_NOT_USED(ctx);
+    pthread_mutex_lock(&mr_forkDrainLock);
+    mr_forkElParked = 1;
+    pthread_cond_broadcast(&mr_forkDrainCond);


Why not use the already-existing sync. functions in utils?

Also: the broadcasting back and forth between ml and worker threads (with its extra locks and hidden locks such that in mr_thpool_num_jobs_in_queue()) pause a threat of increasing latencies for all commands. I would really like a much simpler solution, if possible.

- Wait for the worker pool to drain via the pool's own bounded idle wait (new mr_thpool_wait_timeout, reusing thcount_lock/threads_all_idle) instead of a hand-rolled poll loop plus a per-read jobqueue lock. Removes the hot-path lock contention that risked added command latency. - Remove the max-idle "which shard didn't reply" diagnostics entirely (responded-sets, monotonic timestamps, dispatch/timeout logging, MR_ClusterFormatPendingPeers, mr_thpool_num_jobs_in_queue) -- out of scope for the fork-coordination fix; will be tracked in a separate ticket if needed. - Trim comments (no ticket numbers / implementation rationale). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gabsow · 2026-06-21T15:40:22Z

@galcohen-redislabs thanks for the thorough review — addressed all of it in df7cc09 (net −177/+42):

Removed the max-idle diagnostics entirely — the responded-sets, monotonic timestamps, dispatch/timeout logging, MR_ClusterFormatPendingPeers, and mr_thpool_num_jobs_in_queue. Agreed it's out of scope for the fork-coordination fix; tracked separately as MOD-16404 (your latency + size_t notes are captured there, and the original impl is referenced for whoever picks it up). Removing it also takes the size_t subtraction you flagged with it.
Simplified the drain to reuse existing sync — the worker wait is now a new mr_thpool_wait_timeout() that simply bounds the existing mr_thpool_wait on the pool's own thcount_lock/threads_all_idle. No hand-rolled poll loop, no per-read jobqueue lock, and no el↔worker broadcasting on the command path — so no added per-command latency. The whole drain is now: park the event-loop thread, then mr_thpool_wait_timeout(pool, 2s).
Trimmed the comments — dropped ticket numbers and the how/why-for-this-ticket explanations throughout.

PTAL.

This was referenced Jun 10, 2026

MOD-15307: drain LibMR threads before fork() (fixes MOD-14615 & MOD-14239) RedisTimeSeries/RedisTimeSeries#2055

Open

Add FORK_CHILD_PRE/CANCELLED subevents for pre-fork module coordination redis/redis#15327

Merged

gabsow and others added 2 commits June 11, 2026 13:20

gabsow marked this pull request as ready for review June 17, 2026 18:19

cursor Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread src/mr.c Outdated

gabsow and others added 2 commits June 17, 2026 21:22

Merge branch 'master' into mod-15307-prefork-drain

03aa31e

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread src/mr.c Outdated

Comment thread src/utils/thpool.c Outdated

gabsow requested a review from galcohen-redislabs June 18, 2026 03:45

galcohen-redislabs reviewed Jun 21, 2026

View reviewed changes

gabsow requested a review from galcohen-redislabs June 21, 2026 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drain LibMR threads to a safe point before fork() (MOD-15307)#97

Drain LibMR threads to a safe point before fork() (MOD-15307)#97
gabsow wants to merge 7 commits into
masterfrom
mod-15307-prefork-drain

gabsow commented Jun 10, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

gabsow commented Jun 18, 2026

Uh oh!

galcohen-redislabs Jun 21, 2026

Uh oh!

galcohen-redislabs Jun 21, 2026

Uh oh!

galcohen-redislabs Jun 21, 2026

Uh oh!

galcohen-redislabs Jun 21, 2026

Uh oh!

galcohen-redislabs Jun 21, 2026

Uh oh!

galcohen-redislabs Jun 21, 2026

Uh oh!

galcohen-redislabs Jun 21, 2026

Uh oh!

gabsow commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		size_t MR_ClusterGetSize();

		/* MOD-14615: format peer shards (id(ip:port)) that are NOT in `responded` into

Conversation

gabsow commented Jun 10, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why / depends on

Pre-merge note

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gabsow commented Jun 18, 2026

Uh oh!

galcohen-redislabs Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

galcohen-redislabs Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

galcohen-redislabs Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

galcohen-redislabs Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

galcohen-redislabs Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

galcohen-redislabs Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

galcohen-redislabs Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

gabsow commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gabsow commented Jun 10, 2026 •

edited by cursor Bot

Loading